Add Nemotron Labs Diffusion by smdesai · Pull Request #310 · ml-explore/mlx-swift-lm

smdesai · 2026-05-22T23:02:21Z

Proposed changes

This PR adds support for NVIDIA's Nemotron-Labs-Diffusion model supporting auto-regressive, diffusion and self-speculating modes.

Checklist

I have read the CONTRIBUTING document
I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
I have added tests that prove my fix is effective or that my feature works
I have updated the necessary documentation (if needed)

nemotron-labs-diffusion.mov

davidkoski · 2026-05-26T17:57:24Z

I wonder if the Nemotron and LoRA code could be in two PRs? They look independent.

davidkoski · 2026-05-26T17:58:19Z

+private func nlsLlama4AttentionScale(
+    start: Int, stop: Int, beta: Float, maxPositionEmbeddings: Int
+) -> MLXArray {
+    let positions = MLXArray(Int32(start) ..< Int32(stop))


use arange(start, stop) instead -- that will be done on the GPU (may not be a big deal)

Yes this is cleaner and I can make the change.

As for Nemotron and LoRA code as two separate PRs. Yes, they're independent.

LoRA changes → standalone, mergeable on their own

Nemotron port → depends on the LoRA changes

I'm happy if that's cleaner.

A little easier to review smaller pieces and I can get the LoRA change reviewed and merged faster than the nemotron code.

PR #316 now contains the LoRA changes. This PR will be updated accordingly once that's merged.

Using arange(start, stop) now.

davidkoski · 2026-05-26T18:04:09Z

+    /// Block-wise diffusion (parallel) decoding.
+    ///
+    /// Mirrors `generate(...)` in `modeling_nemotron_labs_diffusion.py` with
+    /// `causal_context = False`. The prompt is causally prefilled into a
+    /// shared KV cache once. Each block of `blockLength` tokens is initialized
+    /// to `mask_token_id` and refined in up to `blockLength` denoising steps:
+    /// at each step the block is forwarded bidirectionally, attending against
+    /// the cached prefix without re-running it. After each step the cache is
+    /// trimmed by `blockLength` so the next step writes over the same slots.
+    /// Once a block stabilizes, we run one causal forward over its finalized
+    /// tokens to commit their K/V into the cache for subsequent blocks.
+    public func diffusionGenerate(


Is this a generic technique required for diffusion models? I wonder if we need a new DiffusionLanguageModel protocol and this becomes generic code?

Same question for arGenerate

What happens if you call the standard generate loop with callAsFunction? Are you losing functionality?

This is the first diffusion model in the repo so hard to say if it's generic. A DiffusionLanguageModel protocol could exist but it would need to be carefully scoped and hard to validate until there's a second or third implementation.

The scoring/commit logic for Nemotron has some quirks, "fallback to a single most confident token if no candidate beats threshold" is a specific decision. A generic protocol may freeze one policy and possibly limit future ports. I'd hold off on this until we have a second model and re-visit.

arGenerate was provided as a way of providing symmetry with diffusionGenerate / linearSpecGenerate. It duplicates the work that LLModel and TokenIterator already do. It can be removed and go through ChatSession, this is what the demo app does so same as generate loop with callAsFunction.

For Diffusion/Self-spec we'd lost almost everything that makes them what they are as it has no idea about bi-directional logits or block-wise denoising.

Removing arGenerate makes sense unless you'd like to keep it for symmetry purposes, perhaps documented and commented as such. Happy either way.

Yeah, as the first diffusion model I agree we are in uncharted territory. I think adding all of it with a comment explaining why it is that way will be fine. If/when we pick up another diffusion model we can revisit.

arGenerate commented and documented and also added a comment around DiffusionLanguageModel protocol .

@davidkoski Changes applied so ready for review.

davidkoski reviewed May 26, 2026

View reviewed changes

smdesai mentioned this pull request May 27, 2026

LoRA: runtime toggle and PEFT adapter loader #316

Merged

4 tasks

Sachin Desai added 2 commits May 27, 2026 15:18

add nemotron labs diffusion

a86575a

clarify arGenerate and diffusionGenerate intent

425becc

smdesai force-pushed the nemotron-labs-diffusion branch from e0934b4 to 425becc Compare May 27, 2026 22:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Nemotron Labs Diffusion#310

Add Nemotron Labs Diffusion#310
smdesai wants to merge 2 commits into
ml-explore:mainfrom
smdesai:nemotron-labs-diffusion

smdesai commented May 22, 2026

Uh oh!

davidkoski commented May 26, 2026

Uh oh!

davidkoski May 26, 2026

Uh oh!

smdesai May 27, 2026

Uh oh!

davidkoski May 27, 2026

Uh oh!

smdesai May 27, 2026

Uh oh!

smdesai May 27, 2026

Uh oh!

davidkoski May 26, 2026

Uh oh!

smdesai May 27, 2026

Uh oh!

davidkoski May 27, 2026

Uh oh!

smdesai May 27, 2026

Uh oh!

smdesai Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

smdesai commented May 22, 2026

Proposed changes

Checklist

Uh oh!

davidkoski commented May 26, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants